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Abstract Functional data analysis involves data described by regular functions 
rather than by a finite number of real valued variables. While some robust data anal- 
ysis methods can be applied directly to the very high dimensional vectors obtained 
from a fine grid sampling of functional data, all methods benefit from a prior sim- 
plification of the functions that reduces the redundancy induced by the regularity. 
In this paper we propose to use a clustering approach that targets variables rather 
than individual to design a piecewise constant representation of a set of functions. 
The contiguity constraint induced by the functional nature of the variables allows a 
polynomial complexity algorithm to give the optimal solution. 



1 Introduction 

Functional data H131 appear in applications in which objects to analyse display some 
form of variability. In spectrometry, for instance, samples are described by spectra: 
each spectrum is a mapping from wavelengths to e.g., transmittanceQ. Time varying 
objects offer a more general example: when the characteristics of objects evolve 
through time, a loss free representation consists in describing these characteristics 
as functions that map time to real values. 

In practice, functional data are given as high dimensional vectors (e.g., more 
than 100 variables) obtained by sampling the functions on a fine grid. For smooth 
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functions (for instance in near infrared spectroscopy), this scheme leads to highly 
correlated variables. While many data analysis methods can be made robust to this 
type of problem (see, e.g., (6] for discriminant analysis), all methods benefit from a 
compression of the data [12] in which relevant and yet easy to interpret features are 
extracted from the raw functional data. 

There are well-known standard ways of extracting optimal features according 
to a given criterion. For instance in unsupervised problems, the first k principal 
components of a dataset give the best linear approximation of the original data in 
R* for the quadratic norm (see fPH for functional principal component analysis 
(PCA)). In regression problems, the partial least-squares approach extracts features 
with maximal correlation with a target variable (see also Sliced Inversion Regression 
methods 0). The main drawback of those approaches is that they extract features 
that are not easy to interpret: while the link between the original features and the 
new ones is linear, it is seldom sparse; an extracted feature generally depends on 
many original features. 

A different line of thoughts is followed in the present paper: the goal is to extract 
features that are easy to interpret in terms of the original variables. This is done by 
approximating the original functions by piecewise constant functions. We first recall 
in Section|2]the best basis problem in the context of functional data approximation. 
Section |3] shows how the problem can be recast in term of a constrained clustering 
problem for which efficient solutions are available. 



2 Best basis for functional data 

Let us consider n functional data, (s,)i<,<„. Each si is a function from [a,b] to R, 
where [a,b] is a fixed interval common to all functions (more precisely, s, belongs 
to L 2 ([a,b]), the set of square integrable functions on [a,b]). In terms of functional 
data, linear feature extraction consists in choosing for each feature a linear opera- 
tor from L 2 ([a 7 b\) to R. Equivalently, one can choose a function from L 2 ([a,b}) 
and compute (si,(j>) L 2 = J a <j>(x)si(x)dx. In an unsupervised context, using e.g., a 
quadratic error measure, choosing the k best features consists in finding k orthonor- 
mal functions ((j>i)\<i<k that minimise the following quantity: 
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The (<pj)i<j<k form an orthonormal basis of the subspace that they span: the optimal 
set of such functions is therefore called the best basis for the original set of functions 

(Si)l<i<n- 

If the fa are unconstrained, the best basis is given by functional PCA [1311 . How- 
ever, in order for the corresponding feature to be easy to interpret, the fa should 
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have compact supports, the simple case of fa = 1^ Vj j being the easiest to analyse 
(Ij M v j(x) = 1 whenx G [u,v] and elsewhere). 

The problem of choosing an optimal basis among a set of bases has been studied 
for some time in the wavelet community l3l [T5ll . In unsupervised context, the best 
basis is obtained by minimizing the entropy of the features (i.e., of the coordinates 
of the functions on the basis) in order to enable compression by discarding the less 
important features. Following lfl2l . lfl4l proposes a different approach, based on B- 
splines: a leave-one-out version of Equation (03 is used to select the best B-splines 
basis. While the orthonormal basis induced by the B-splines does not correspond to 
compactly supported functions, the dependency between a new feature and the orig- 
inal ones is still localized enough to allow easy interpretation. Nevertheless both 
approaches have some drawbacks. Wavelet based methods lead to compactly sup- 
ported basis functions but the basis has to be chosen in a tree structured set of bases. 
As a consequence, the support of a basis function cannot be any sub-interval of 
[a,b]. The B-spline approach suffers from a similar problem: the approximate sup- 
ports have all the same lengths leading either to a poor representation of some local 
details or to a large number of basis functions. 



3 Best basis via constrained clustering 

3.1 From best basis to constrained clustering 

The goal of the present paper is to select an optimal basis using only basis func- 
tions of the form I( U)V ), without restriction on the possible intervals among sub- 
interval of [a,£>jl Let us consider {fa- = v ,L u J (u-,v))i<j<k such an orthonormal ba- 
sis. We assume that the ((w/,v/))i<,-<£ form a partition of [a,b]. Obviously, we have 
(<j>j,Si) = v ._ u . Juj Si(x)dx, i.e., the feature corresponding to fa is the mean value of 

Si on [uj, vj\ . In other words, Y!j=i ( s i> 0t)z, 2 0* * s a piecewise constant approximation 
of Si (which is optimal according to the L 2 norm). 

In practice, functional data are sampled on a fine grid with support points a < t \ < 
... < t m < b, i.e., rather than observing the functions (s,)i< ( <„, one gets the vectors 
{si{tl))\<i<n,l<l<m from W n . Then (fa,Si) can be approximated by r^De/,- 
where L is the subset of indexes {l,...,m} such that ?/ G (uj,vj) / G Ij. Any par- 
tition of {{uj,Vj))i<j<k °f [ a ib] corresponds to a partition of {l,...,m} in k subsets 
(Ij)i<j<k that satisfies an ordering constraint: if r and s belong to Ij then any integer 
t G [r,s] belongs also to Ij. Finding the best basis means for instance minimizing the 
sum of squared errors given by Equation (Q~|i which can be approximated as follows 



2 The notations (u, v) is used to include all the possible cases of open and close boundaries for the 
considered intervals. 
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The second version of the error shows that it corresponds to an additive quality 
measure of the partition of {l,...,m} induced by the (Ij)i<j<k- Therefore, finding 
the best basis for the sampled functions is equivalent to finding an optimal partition 
of {1, ...,m} with some ordering constraints and according to an additive cost func- 
tion. A suboptimal solution to this problem, based on an ascending (agglomerative) 
hierarchical clustering, is proposed in J9J . 



3.2 Dynamic programming 

However, an optimal solution can be reached in a reasonable amount of time, as 
pointed out in ifTOl : when the quality criterion of a partition is additive and when a 
total ordering constraint is enforced, a dynamic programming approach leads to the 
optimal solution (this is a generalization of the algorithm proposed by Bellman for 
a single function in lfT6l l2l; see also IUE1 for rediscoveries/extensions of this early 
work). The algorithm is simple and proceeds iteratively by computing F(j,k) as the 
value of the quality measure (from Equation (O) of the best partition in k classes of 
{;,..., m}: 

1. initialization: set F(j, 1) to Q({j, ■ ■ ■ ,m}) for all j 

2. iterate from p = 2 to k: 

a. for all 1 < j < m — p + 1 compute 

F{j,p)=. min Q({j,...,l})+F(l + l,p-l) 

j<l<m-p+\ 

The minimizing index / = l(j,p) is kept for all j and p. This allows to reconstruct 
the best partition by backtracking from F(l,k): the first class of the partition is 
{1,...,Z(1,£)}, the second {1(1, k) + 1, . . . ,1(1(1, k) + l,k- 1)}, etc. A similar al- 
gorithm was used to find an optimal approximation of a single function in lEl fTD . 
Another related work is Q which provides simultaneously a functional clustering 
and a piecewise constant approximation of the prototype functions. 

The internal loop runs 0(km 2 ) times. It uses the values Q({j, . . . ,/}) for all j < I. 
Those quantities can be computed prior to the search for the optimal partition, using 
for instance a recursive variance computation formula, leading to a cost in 0(nm 2 ). 
More precisely, we are interested in 
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For a fixed function s,-, the Mjjj and Qijj are computed and stored in two mxm 
arrays, according to the following algorithm: 

1. initialisation: setMy ; =Si(tj) and Qij j =0forall j G {l,...,m} 

2. compute M^ij and Qi.ij for j > 1 recursively with: 

■V/.l.- j (■:./ ll.W,;.; ; ■ ,,;/;;) 

3. compute M,- / and Q; ; ; for / > y > 1 recursively with: 

M-.;,/ = / 2)M i> y_ , ,, - Sj (f;- 1 )) 

Gu,/ = G/j-iy - 7^7^^(0-1) -M U j) 2 

This algorithm is applied to each function leading to a total cost of 0(nm 2 ) with a 
0{m 2 ) storage. The full algorithm has therefore a complexity of 0((n + k)m 2 ). 



3.3 Extensions 

As pointed out in ifTol . the previous scheme can be used for any additive quality 
measure. It is therefore possible to use e.g., a piecewise linear approximation of the 
functions on a sub-interval rather than a constant approximation (this is the origi- 
nal problem studied in J2) for a single function). However, additivity is a stringent 
restriction. In the case of a piecewise linear approximation for instance, it prevents 
the introduction of continuity conditions: if one searches for the best continuous 
piecewise linear approximation of a function, then the optimized criterion is no 
more additive (this is in fact the case for all spline smoothing approaches expect 
the piecewise constant ones). 

In addition, for the general case of an arbitrary quality measure Q there might be 
no recursive formula for evaluating Q. In this case, the cost of computing the needed 
quantities might exceed 0(nm 2 ) and reach (9(nm 3 ) or more, depending on the exact 
definition of Q. 
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That said, the particular case of leave-one-out is quite interesting. Indeed when 
the studied functions are noisy, it is important to rely on a good estimate of the ap- 
proximation error to avoid overfitting the best basis to the noise. It is straightforward 
to show that the leave-one-out (l.o.o.) estimate of the total error from equation (O 
is given by 

when l.o.o. is done on the sampling points of the functions. This is an additive qual- 
ity measure which can be computed using from the <2< ; /, that is in an efficient recur- 
sive way. As shown above, the piecewise constant approximation with k segments 
is obtained via the computation of the best approximation for all / in {l,...,k}. It 
is then possible to choose the best / based on the leave-one-out error estimate at the 
same cost as the one needed to compute the best approximation for the maximal 
value of /. This leads to two variants of the algorithm. In the first one, the standard 
algorithm is applied to compute all the best bases and the best number of segments 
is chosen via the l.o.o. error estimate (which can be readily computed once the best 
basis is known). In the second one, we compute the best basis directly according 
to the l.o.o. error estimate, leveraging its additive structure. It is expected that this 
second solution will perform better in practice, as it constrains the best basis to be 
reasonable (see Section|4]for an experimental validation). For instance, it will never 
select an interval with only one point whereas this could be the case for the stan- 
dard solution. As a consequence, the standard solution will likely produce bases 
with rather bad leave-one-out performances and tend to select a too small number 
of segments (see Section|4]for an example of this behavior). 
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Fig. 1 Three spectra from the Wine dataset 
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4 Experiments 

We illustrate the algorithm on the Wine datase0 which consists in 124 spectra of 
wine samples recorded in the mid infrared range at 256 different wavenumberfl 
between 4000 and 400 cm -1 . Spectra number 34, 35 and 84 of the learning set of 
the original dataset have been removed as they are outliers. As shown on Figure [T] 
the function approximation problem is interesting as the smoothness of the spectrum 
varies along the spectral range and an optimal basis will obviously not consist in 
functions with supports of equal size. Figure [2] shows an example of the best basis 
obtained by the proposed approach for k = 16 clusters, while Figure [3] gives the 
suboptimal solution obtained by a basis with equal length intervals (as used in (14\ ). 
The uniform length approach is clearly unable to pick up details such as the peak 
on the right of the spectra. The total approximation error (equation (O) is reduced 
from 62.66 with the uniform approach to 7.74 with the optimal solution. On the 
same dataset, the greedy ascending hierarchical clustering approach proposed in 
JU reaches a total error of 8.55 for a similar running time of the optimal approach 
proposed in the present paper. 
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Fig. 2 Example of the optimal approximation results for 1 6 clusters on the Wine dataset 



4000 



To test the leave-one-out approach, we have first added a Gaussian noise with 
0.04 standard deviation (the functions take values in [—0.265,0.581]). Then we look 
for the best basis up to 64 segments. As expected, the total approximation error 

3 This dataset is provided by Prof. Marc Meurens, Universite catholique de Louvain, BNUT unit, 
and available at http : / / www . ucl . ac . be /mlg/ index . php?page=DataBases 

4 The wavenumber is the inverse of the wavelength. 
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Fig. 3 Example of the uniform approximation results for 16 clusters on the Wine dataset 

decreases with the number of segments and would therefore lead to a best basis 
with 64 segments. Moreover, as explained in the previous Section, the bases are 
not controlled by a l.o.o. error estimate. As a consequence, the optimization leads 
very quickly to basis with very small segments (starting at k = 12, there is at least 
one segment with only one sample point in it). Therefore, the l.o.o. error estimate 
applied to this set of bases selects a quite low number of segments, namely k = 1 1 . 
When the bases are optimized according to the l.o.o. error estimate, the behavior is 
more smooth in the sense that small segments are always avoided. The minimum 
value of the l.o.o. estimate leads to the selection of k = 20 segments. 



Basis 


Noisy data Real spectra 


k = 64 (standard approach) 


37.28 14.35 


k = 1 1 (l.o.o. after the standard approach) 


63.19 17.35 


k = 20 (full l.o.o.) 


54.07 12.07 



Table 1 Total squared errors for the Wine dataset with noise 

TableQ]summarizes the results by displaying the total approximation error on the 
noisy spectra and the total approximation error on the original spectra (the ground 
truth) for the three alternatives. The full l.o.o. approach leads clearly to the best 
results, as illustrated on Figures|4]and|5] 

Those experiments show that the proposed approach is flexible and provides an 
efficient way to get an optimal basis for a set of functional data. We are currently 
investigating supervised extensions of the approach following principles from Q. 
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Fig. 4 Best basis selected by leave-one-out with the standard approach combined with loo 
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Fig. 5 Best basis selected by leave-one-out with the full loo approach 
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